Week 3 Project (DS as a Field)

Packages:

Packages were installed locally and loaded below:

library(tidyverse)
library(lubridate)
library(hms)
library(ggplot2)
library(dplyr)

Importing Dataset

This dataset titled: NYPD Shooting Incident Data, is from the open data portal. We read it into R using ‘read.csv()’ in a reproducible manner, and examine it using head().

df<-read.csv("https://data.cityofnewyork.us/api/views/833y-fsy8/rows.csv")
head(df)

##   INCIDENT_KEY OCCUR_DATE OCCUR_TIME     BORO LOC_OF_OCCUR_DESC PRECINCT
## 1    231974218 08/09/2021   01:06:00    BRONX                         40
## 2    177934247 04/07/2018   19:48:00 BROOKLYN                         79
## 3    255028563 12/02/2022   22:57:00    BRONX           OUTSIDE       47
## 4     25384540 11/19/2006   01:50:00 BROOKLYN                         66
## 5     72616285 05/09/2010   01:58:00    BRONX                         46
## 6     85875439 07/22/2012   21:35:00    BRONX                         42
##   JURISDICTION_CODE LOC_CLASSFCTN_DESC             LOCATION_DESC
## 1                 0                                             
## 2                 0                                             
## 3                 0             STREET            GROCERY/BODEGA
## 4                 0                                    PVT HOUSE
## 5                 0                      MULTI DWELL - APT BUILD
## 6                 2                    MULTI DWELL - PUBLIC HOUS
##   STATISTICAL_MURDER_FLAG PERP_AGE_GROUP PERP_SEX      PERP_RACE VIC_AGE_GROUP
## 1                   false                                                18-24
## 2                    true          25-44        M WHITE HISPANIC         25-44
## 3                   false         (null)   (null)         (null)         25-44
## 4                    true        UNKNOWN        U        UNKNOWN         18-24
## 5                    true          25-44        M          BLACK           <18
## 6                   false          18-24        M          BLACK         18-24
##   VIC_SEX VIC_RACE              X_COORD_CD             Y_COORD_CD Latitude
## 1       M    BLACK                 1006343                 234270 40.80967
## 2       M    BLACK 1000082.937500000000000 189064.671875000000000 40.68561
## 3       M    BLACK                 1020691                 257125 40.87235
## 4       M    BLACK  985107.312500000000000 173349.796875000000000 40.64249
## 5       F    BLACK 1009853.500000000000000 247502.562500000000000 40.84598
## 6       M    BLACK 1011046.687500000000000 239814.234375000000000 40.82488
##   Longitude                                       Lon_Lat
## 1 -73.92019  POINT (-73.92019278899994 40.80967347200004)
## 2 -73.94291 POINT (-73.94291302299996 40.685609672000055)
## 3 -73.86823                  POINT (-73.868233 40.872349)
## 4 -73.99691 POINT (-73.99691224999998 40.642489932000046)
## 5 -73.90746  POINT (-73.90746098599993 40.84598358900007)
## 6 -73.90318  POINT (-73.90317908399999 40.82487781900005)

Cleaning

Below we change date and time types, convert boolean columns to numerical, removing columns we will not use, and making sure null values are in the same format across all columns. We will decide how to use nulls during our analysis, depending on which question we are trying to answer. We only used non null age data when creating visualization 1, and there were no nulls for visualization 2.

df <- df %>%      #start of pipechain from tidyverse package - updates df with the following pipechain changes:
  
  mutate(OCCUR_DATE = mdy(OCCUR_DATE)) %>%  #changing to type date; the symbol %>% is like saying "then", telling the pipechain what to do next
  mutate(OCCUR_TIME = as_hms(OCCUR_TIME)) %>%  #changing type to time using HMS from lubridate HH:MM:SS
  mutate(STATISTICAL_MURDER_FLAG = as.integer(as.logical(STATISTICAL_MURDER_FLAG))) %>% #changes true/false to 1/0
  
  select(-c(INCIDENT_KEY,Y_COORD_CD,X_COORD_CD,Lon_Lat,Longitude,Latitude,LOC_CLASSFCTN_DESC,JURISDICTION_CODE)) %>% #removing columns we won't
  
  mutate(across(where(is.character), ~ na_if(.x, "null"))) %>% #changes all null text to real null
  mutate(across(where(is.character), ~ na_if(.x, "NA"))) %>% #changes all NA text to real null
  mutate(across(where(is.character), ~ na_if(.x, ""))) #changes all blank text to real null
  
  colSums(is.na(df)) #checking null values in each col

##              OCCUR_DATE              OCCUR_TIME                    BORO 
##                       0                       0                       0 
##       LOC_OF_OCCUR_DESC                PRECINCT           LOCATION_DESC 
##                   25596                       0                   14977 
## STATISTICAL_MURDER_FLAG          PERP_AGE_GROUP                PERP_SEX 
##                       0                    9344                    9310 
##               PERP_RACE           VIC_AGE_GROUP                 VIC_SEX 
##                    9310                       0                       0 
##                VIC_RACE 
##                       0

  summary(df) #print summary as requested in project instructions

##    OCCUR_DATE          OCCUR_TIME           BORO           LOC_OF_OCCUR_DESC 
##  Min.   :2006-01-01   Length:29744      Length:29744       Length:29744      
##  1st Qu.:2009-10-29   Class1:hms        Class :character   Class :character  
##  Median :2014-03-25   Class2:difftime   Mode  :character   Mode  :character  
##  Mean   :2014-10-31   Mode  :numeric                                         
##  3rd Qu.:2020-06-29                                                          
##  Max.   :2024-12-31                                                          
##     PRECINCT      LOCATION_DESC      STATISTICAL_MURDER_FLAG PERP_AGE_GROUP    
##  Min.   :  1.00   Length:29744       Min.   :0.0000          Length:29744      
##  1st Qu.: 44.00   Class :character   1st Qu.:0.0000          Class :character  
##  Median : 67.00   Mode  :character   Median :0.0000          Mode  :character  
##  Mean   : 65.23                      Mean   :0.1938                            
##  3rd Qu.: 81.00                      3rd Qu.:0.0000                            
##  Max.   :123.00                      Max.   :1.0000                            
##    PERP_SEX          PERP_RACE         VIC_AGE_GROUP        VIC_SEX         
##  Length:29744       Length:29744       Length:29744       Length:29744      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    VIC_RACE        
##  Length:29744      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Visualizations:

We are requested to add at least two visualizations and some analysis to our data.

df %>%
  
  filter(
    !PERP_AGE_GROUP %in% c("1020", "1028", "223","224", "940", "UNKNOWN", "(null)"),
    !is.na(PERP_AGE_GROUP)
  ) %>%
  
  count(PERP_AGE_GROUP) %>%
  
  mutate(
    pct = n / sum(n),
    label = paste0(PERP_AGE_GROUP, " (", round(pct * 100, 1), "%)")
  ) %>%
  
  ggplot(aes(x = "", y = pct, fill = label)) +
  geom_col(width = 1) +
  coord_polar(theta = "y") +
  labs(title = "Perpetrator Age Group Distribution (%)", fill = "Age Group") +
  theme_void()

Analysis for Visualization 1:

From our first visualization, We see that the vast majority of incidents (that include a perp’s age) belong to the 18-44 age group. This age group accounts for about 83.6% of incidents. Some questions this raises to me, is that since the 45-64 age group percentage is so low, at only 4.7%, but the 25-44 age group is 40.5%, what is the real cut off in the 25-44 age group where the incidents start to become less? 35? 40? Better age data is required to see.

Here is the next visualization:

df %>%
  count(BORO) %>%
  ggplot(aes(x = reorder(BORO, -n), y = n, fill = BORO)) +
  geom_col() +
  geom_text(aes(label = n), vjust = -0.5) +  #add the labels
  labs(title = "Incident Count per Borough", x = "Borough", y = "Number of Incidents") +
  theme_minimal() +
  theme(legend.position = "none")

Analysis for Visualization 2:

In the second visual, we compare the number of incidents throughout the 5 boroughs. Brooklyn is the borough with the most incidents, followed by the Bronx, Queens, Manhattan, and Staten Island. Indicating that Staten Island has the least shooting, and Brooklyn and the Bronx has the most shooting incidents. What factors makes Brooklyn so much worse?

Model

We will look at a logistic categorical model showing how the time of day affects the likelihood of a shooting incident resulting in a murder. (Statistical_murder_flag)

df <- df %>%
    filter(!is.na(STATISTICAL_MURDER_FLAG)) %>%
  mutate(hour = as.factor(substr(OCCUR_TIME, 1, 2)))

model <- glm(
  STATISTICAL_MURDER_FLAG ~ hour,
  data = df,
  family = binomial
)

summary(model)

## 
## Call:
## glm(formula = STATISTICAL_MURDER_FLAG ~ hour, family = binomial, 
##     data = df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.52699    0.05403 -28.264  < 2e-16 ***
## hour01      -0.02095    0.07769  -0.270  0.78742    
## hour02      -0.02063    0.08073  -0.256  0.79829    
## hour03      -0.04470    0.08358  -0.535  0.59274    
## hour04       0.16651    0.08319   2.002  0.04534 *  
## hour05       0.38284    0.10004   3.827  0.00013 ***
## hour06       0.38232    0.12742   3.001  0.00270 ** 
## hour07       0.69465    0.14682   4.731 2.23e-06 ***
## hour08       0.48722    0.14918   3.266  0.00109 ** 
## hour09       0.33153    0.15810   2.097  0.03600 *  
## hour10       0.36776    0.13907   2.644  0.00818 ** 
## hour11       0.38490    0.12443   3.093  0.00198 ** 
## hour12       0.24880    0.11575   2.149  0.03160 *  
## hour13       0.29313    0.10874   2.696  0.00702 ** 
## hour14       0.23627    0.09777   2.416  0.01567 *  
## hour15      -0.03711    0.09894  -0.375  0.70759    
## hour16      -0.13743    0.09686  -1.419  0.15597    
## hour17       0.23096    0.08929   2.587  0.00969 ** 
## hour18       0.27194    0.08397   3.239  0.00120 ** 
## hour19       0.06647    0.08342   0.797  0.42552    
## hour20      -0.05346    0.08214  -0.651  0.51518    
## hour21       0.11324    0.07658   1.479  0.13924    
## hour22       0.19672    0.07412   2.654  0.00795 ** 
## hour23      -0.05737    0.07608  -0.754  0.45080    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 29251  on 29743  degrees of freedom
## Residual deviance: 29128  on 29720  degrees of freedom
## AIC: 29176
## 
## Number of Fisher Scoring iterations: 4

Model Analysis

The coefficients show which hours of the day are more associated with murder compared to hour 0 (midnight). Hours 5, 7, 8, and 11 show the largest positive coefficients with significantly low p values. Murders are most likely from 5am to 2pm based on this data, which was a surprise, as I expected most deadly shootings to be late at night.

Conclusion / Biases

This was a short report where we only looked at 2 visualizations for the data. We found out that Brooklyn has the most shooting incidents, and young men account for the large majority of these incidents (gender was not included in these visualizations but it is mostly male perpetrators).

Potential biases may be, I myself am a male within the prevalent age group, yet my friends and I do not see any thing like what was reported. Then again, I do not live within the test area, which could be another potential bias. I mitigate that by realizing that different geographical locations may have different cultures or crime rates.